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Abstract 


Recent work on end-to-end neural 
network-based architectures for machine 
translation has shown promising results 
for En-Fr and En-De translation. Ar¬ 
guably, one of the major factors behind 
this success has been the availability of 
high quality parallel corpora. In this work, 
we investigate how to leverage abundant 
monolingual corpora for neural machine 
translation. Compared to a phrase-based 
and hierarchical baseline, we obtain up 
to 1.96 BEEU improvement on the low- 
resource language pair Turkish-English, 
and 1.59 BEEU on the focused domain 
task of Chinese-English chat messages. 
While our method was initially targeted 
toward such tasks with less parallel data, 
we show that it also extends to high 
resource languages such as Cs-En and 
De-En where we obtain an improvement 
of 0.39 and 0.47 BEEU scores over the 
neural machine translation baselines, 
respectively. 


1 Introduction 


Neural machine translation (NMT) is a novel 
approach to machine translation that has shown 
promising results ([Kalchbrenner and Blunsonq 


2013t ISutskever et al., 2014t |Cho et al., 2014t 


Bahdanau et al., 2014). Until recently, the applica¬ 


tion of neural networks to machine translation was 
restricted to extending standard machine transla¬ 
tion tools for rescoring translation hypotheses or 
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re-ranking n-best lists (see, e.g., (Schwenk, 2012 


Schwenk, 2007a|l. In contrast, it has been shown 


that, it is possible to build a competitive trans¬ 
lation system for English-French and English- 
German using an end-to-end neural network archi¬ 
tecture dSutskever et al., 2014[ [Jean et al., 2014| | 
(also see Sec.|^. 


Arguably, a large part of the recent success of 
these methods has been due to the availability of 
large amounts of high quality, sentence aligned 
corpora. In the case of low resource language 
pairs or in a task with heavy domain restrictions, 
there can be a lack of such sentence aligned cor¬ 
pora. In contrast, monolingual corpora is almost 
always universally available. Despite being “unla¬ 
beled”, monolingual corpora still exhibit rich lin¬ 
guistic structure that may be useful for translation 
tasks. This presents an opportunity to leverage 
such corpora to give hints to an NMT system. 


In this work, we present a way to effectively 
integrate a language model (EM) trained only on 
monolingual data (target language) into an NMT 
system. We provide experimental results that in¬ 
corporating monolingual corpora can improve a 
translation system on a low-resource language pair 
(Turkish-English) and a domain restricted trans¬ 
lation problem (Chinese-English SMS chat). In 
addition, we show that these methods improve 
the performance on the relatively high-resource 
German-English (De-En) and Czech-English (Cs- 
En) translation tasks. 

In the following section (Sec. |^, we review 
recent work in neural machine translation. We 
present our basic model architecture in Sec. and 
describe our shallow and deep fusion approaches 
in Sec.|^ Next, we describe our datasets in Sec.|^ 
Finally, we describe our main experimental results 
in Sec. [6l 


























2 Background: Neural Machine 
Translation 

Statistical machine translation (SMT) systems 
maximize the conditional probability p{y \ x) of 
a correct target translation y given a source sen¬ 
tence X. This is done by maximizing separately a 
language model p{y) and the (inverse) translation 
model p(x | y) component by using Bayes’ rule: 

p{y I x) ocp(x I y)p(y). 


This decomposition into a language model and 
translation model is meant to make full use of 
available corpora: monolingual corpora for fitting 
the language model and parallel corpora for the 
translation model. In reality, however, SMT sys¬ 
tems tend to model logp(y | x) directly by lin¬ 
early combining multiple features by using a so- 
called log-linear model: 


logp(y 


X = 


+ c, 


( 1 ) 


where fj is the j-th feature based on both or either 
of the source and target sentences, and C is a nor¬ 
malization constant which is often ignored. These 
features include, for instance, pair-wise statistics 
between two sentences/phrases. The log-linear 
model is fitted to data, in most cases, by maximiz¬ 
ing an automatic evaluation metric other than an 
actual conditional probability, such as BLEU. 

Neural machine translation, on the other hand, 
aims at directly optimizing logp(y | x) includ¬ 
ing the feature extraction as well as the normaliza¬ 
tion constant by a single neural network. This is 
typically done under the encoder-decoder frame¬ 


work (Kalchbrenner and Blunsom, 2013 Cho et 


ah, 2014; Sutskever et ah, 20141 consisting of neu¬ 
ral networks. The first network encodes the source 
sentence x into a continuous-space representa¬ 
tion from which the decoder produces the target 
translation sentence. By using RNN architectures 
equipped to learn long term dependencies such 
as Gated Recurrent Units (GRU) or Long Short- 
Term Memory (LSTM), the whole system can be 
trained end-to-end (|Cho et ah, 2014[ [Sutskever et 


ah, 20141. 


Once the model learns the conditional distribu¬ 
tion or translation model, given a source sentence 
we can find a franslafion fhaf approximafely max¬ 
imizes fhe condifional probabilify using, for in- 
sfance, a beam search algorifhm. 


3 Model Description 


We use fhe model recenfly proposed by ( Bahdanair 
ef ah, 20l4] ) fhaf learns fo joinfly (soff-)align and 
translafe as fhe baseline neural machine franslafion 
sysfem in fhis paper. Here we describe in defail 
fhis model fo which we refer as “NMT”. 

The encoder of fhe NMT is a bidirectional 
RNN which consisfs of forward and backward 
RNNs (ISchusler and Paliwal, 1997||. The for¬ 


ward RNN reads fhe inpuf sequence/senfence x = 
(xi,..., xt) in a forward direction, resulting in 
a sequence of hidden sfafes (hi,..., h^). The 
backward RNN reads x in an opposife direction 
and oufpufs (tii,...,1h7’). We concafenafe a pair 
of hidden sfafes af each fime sfep fo build a se¬ 
quence of annotation vectors (hi,..., hy), where 


h^ = 


U; 


Each annofafion vecfor hj encodes information 
abouf fhe j-th word wifh respecf fo all fhe ofher 
surrounding words in fhe senfence. 

In our decoder, which we consfrucf wifh a sin¬ 
gle layer RNN, af each fimesfep t a soff-alignmenf 
mechanism firsf decides on which annofafion vec¬ 
tors are mosf relevanf. The relevance weighf atj of 
fhe j-th annofafion vecfor for fhe f-fh fargef word 
is compufed by a feedforward neural nefwork f 
fhaf fakes as inpuf hj, fhe previous decoder’s hid¬ 
den sfafe St-i and fhe previous oufpuf yt_i: 


— f(st—1) hj) yt—1) • 


The oufpufs etj are normalized over fhe sequence 
of fhe annofafion vecfors so fhaf fhe fhey sum fo 1: 


Oitj 


exp(etj) 

ELiexp(eife)’ 


( 2 ) 


and we call atj a relevance score, or an alignmenf 
weighf, of fhe j-fh annofafion vecfor. 

The relevance scores are used fo gef fhe context 
vector ct of fhe f-fh word in fhe franslafion: 


T 

Ct = Citji^j , 
j=t 


Then, fhe decoder’s hidden sfafe s™ af fime t is 
compufed based on fhe previous hidden sfafe s™ , 
fhe confexf vecfor Ct and fhe previously franslafed 
word yt-i: 


TM 


f is™ 


yt-t,ct), 


( 3 ) 
























where is the gated recurrent unit (Cho et ah, 


20T4] ). 

We use a deep output layer ([Pascanu et ah, 


20141 to compute the conditional distribution over 


words: 

p(yt|y<4,x) oc 


ex.p{yl {Wofo{s™,yt-i,Ct)+ho)), 


( 4 ) 


where yt is a one-hot encoded vector indicating 
one of the words in the target vocabulary. Wo is 
a learned weight matrix and bo is a bias, fo is 
a single-layer feedforward neural network with a 


two-way maxout non-linearity (Goodfellow et ah, 


20131. 


The whole model, including both the encoder 
and decoder, is jointly trained to maximize the 
(conditional) log-likelihood of the bilingual train¬ 
ing corpus: 


max^ ^logp0(y("^|x(”)), 

n=l 

where the training corpus is a set of y("'))’s, 
and 6 denotes a set of all the tunable parameters. 

4 Integrating Language Model into the 
Decoder 


weighted sum of the scores given by the transla¬ 
tion model and the language model. 

More specifically, at each time step t, the trans¬ 
lation model (in this case, the NMT) computes the 
score of every possible next word for each hypoth¬ 
esis all of hypotheses |y<|_]^|. Each score is the 
summation of the score of the hypothesis and the 
score given by the NMT to the next word. All 
these new hypotheses (a hypothesis from the pre¬ 
vious timestep with a next word appended at the 
end) are then sorted according to their respective 
scores, and the top K ones are selected as candi- 
dales . 

We then rescore these hypotheses with the 
weighted sum of the scores by the NMT and 
RNNLM, where we only need to recompute the 
score of the “new word” at the end of each can¬ 
didate hypothesis. The score of the new word is 
computed by 

logp(yt = k) = logpjuiyt = k) 

-h ^logpLM(yt = k), 

where /3 is a hyper-parameter that needs to be 
tuned to maximize the translation performance on 
a development set. 

See Fig.[T](a) for illustration. 


In this paper, we propose two alternatives to in¬ 
tegrating a language model into a neural machine 
translation system which we refer as shallow fu¬ 
sion (Sec. |4.1| ) wA deep fusion (Sec. 4.21. Without 
loss of generality, we use a language model based 


on recurrent neural networks (RNNLM, (Mikolov 


et ah, 20111) which is equivalent to the decoder de¬ 


scribed in the previous section except that it is not 
biased by a context vector (i.e., = 0 in Eqs. Q- 

0 ). 


In the sections that follow, we assume that 
both an NMT model (on parallel corpora) as well 
as a recurrent neural network language model 
(RNNLM, on larger monolingual corpora) have 
been pre-trained separately before being inte¬ 
grated. We denoted the hidden state at time t of 
the RNNLM with 


4.2 Deep Fusion 

In deep fusion, we integrate the RNNLM and the 
decoder of the NMT by concatenating their hid¬ 
den states next to each other (see Fig. [3 (b)). 
The model is then finetuned to use the hidden 
states from both of these models when comput¬ 
ing the output probability of the next word (see 
Eq. Q). Unlike the vanilla NMT (without any 
language model component), the hidden layer of 
the deep output takes as input the hidden state of 
the RNNLM in addition to that of the NMT, the 
previous word and the context such that 


p(yt|y<t>x) oc 

exp(y7(Wofo(s^“, yt-i, Cf) -f bo)), 

( 6 ) 


4.1 Shallow Fusion 

Shallow fusion is analogous to how language mod¬ 
els are used in the decoder of a usual SMT sys¬ 
tem (Koehn, 20101. At each time step, the trans¬ 
lation model proposes a set of candidate words. 
The candidates are then scored according to the 


where again we use the superscripts and 
to denote the hidden states of the RNNLM and 
NMT respectively. 

During the finetuning of the model, we tune 
only the parameters that were used to parameterize 
the output Q. This is to ensure that the structure 













(a) Shallow Fusion (Sec. |4.1[ ) 



(b) Deep Fusion (Sec. 4.2 1 


Figure 1: Graphical illustrations of the proposed fusion methods. 


learned by the LM from monolingual corpora is 
not overwritten. It is possible to use monolingual 
corpora as well while finetuning all the parame¬ 
ters, but in this paper, we alter only the output pa¬ 
rameters in the stage of finetuning. 

4.2.1 Balancing the LM and TM 

In order for the decoder to flexibly balance the in¬ 
put from the LM and TM, we augment the decoder 
with a “controller” mechanism. The need to flex¬ 
ibly balance the signals arises depending on the 
work being translated. For instance, in the case 
of Zh-En, there are no Chinese words that corre¬ 
spond to articles in English, in which case the EM 
may be more informative. On the other hand, if 
a noun is to be translated, it may be better to ig¬ 
nore any signal from the EM, as it may prevent the 
decoder from choosing the correct translation. In¬ 
tuitively, this mechanism helps the model dynami¬ 
cally weight the different models depending on the 
word being translated. 

The controller mechanism is implemented as a 
function taking the hidden state of the EM as input 
and computing 

gt = (T + bg^ , (7) 

where cr is a logistic sigmoid function. Vg and bg 
are learned parameters. 

The output of the controller is then multiplied 
with the hidden state of the EM. This lets the de¬ 


coder use the signal from the TM fully, while the 
controller controls the magnitude of the EM sig¬ 
nal. 

In our experiments, we empirically found that it 
was better to initialize the bias bg to a small, neg¬ 
ative number. This allows the decoder to decide 
the importance of the EM only when it is deemed 
necessary. [] 

5 Datasets 

We evaluate the proposed approaches on four di¬ 
verse tasks: Chinese to English (Zh-En), Turkish 
to English (Tr-En), German to English (De-En) 
and Czech to English (Cs-En). We describe each 
of these datasets in more detail below. 

5.1 Parallel Corpora 

5.1.1 Zh-En: OpenMT’15 

We use the parallel corpora made available 
as a part of the NIST OpenMT’15 Challenge. 
Sentence-aligned pairs from three domains are 
combined to form a training set: (1) SMS/CHAT 
and (2) conversational telephone speech (CTS) 
from DARPA BOET Project, and (3) newsgroup¬ 
s/weblogs from DARPA GAEE Project. In total, 
the training set consists of 430K sentence pairs 
(see Table [T] for the detailed statistics). We train 

In all our experiments, we set bg = — 1 to ensure that 
gt is initially 0.26 on average. 






















models with this training set and the development 
set (the concatenation of the provided develop¬ 
ment and tune sets from the challenge), and evalu¬ 
ate them on the test set. The domain of the devel¬ 
opment and test sets is restricted to CTS. 


Preprocessing Importantly, we did “not seg¬ 
ment” the Chinese sentences and considered each 
character as a symbol, unlike other approaches 
which use a separate segmentation tool to segment 


the Chinese characters into words (Devlin et ah. 


20141. Any consecutive non-Chinese characters 
such as Latin alphabets were, however, considered 
as an individual word. Lastly, we removed any 
HTML/XML tags from the corpus, chose only the 
intended meaning word if both intended and literal 
translations are available, and ignored any indica¬ 
tor of, e.g., typos. The only preprocessing we did 
on the English side of the corpus was a simple to- 
kenization using the tokenizer from Moses. [] 

5.1.2 Tr-En: IWSLT’14 


We used the WIT parallel corpus (Cettolo et ah. 


2012 1 and SETimes parallel corpus made available 
as a part of IWSET’ 14 (machine translation track). 
The corpus consists of the sentence-aligned subti¬ 
tles of TED and TEDx talks, and we concatenated 
dev2010 and tst2010 to form a development set, 
and tst2011, tst2012, tst2013 and tst2014 to form 
a test set. See Table [T] for the detailed statistics of 
the parallel corpora. 


Preprocessing As done with the case of Zh-En, 
initially we removed all special symbols from the 
corpora and tokenized the Turkish side with the to¬ 
kenizer provided by Moses. To overcome the ex¬ 
ploding vocabulary due to the rich inflections and 
derivations in Turkish, we segmented each Turk¬ 
ish sentence into a sequence of sub-word units 
using Zemberel![] followed by morphological dis¬ 


ambiguation on the morphological analysis (Sak 


et ah, 20071. We removed any non-surface mor¬ 


phemes corresponding to, for instance, part-of- 
speech tags. 


5.2 Cs-En and De-En: WMT’15 

Eor the training of our models, we used all the 
available training data provided for Cs-En and De- 
En in the WMT’15 competition. We used new- 

https://github.com/moses-smt/ 
mosesdecoder/blob/master/scripts/ 
tokenizer/tokenizer.perl 

https://github.com/ahmetaa/ 
zemberek-nlp 


stest2013 as a development set and newstest2014 
for a test set. The detailed statistics of the parallel 
corpora is provided in Table [T] 


Preprocessing We tokenized the datasets with 
Moses tokenizer first. Sentences longer than 
eighty words and those that have large mismatch 
between lengths of the source and target sentences 
were removed from the training set. Then, we fil¬ 
tered the training data by removing sentence pairs 
in which one sentence (or both) was written in 
the wrong language by using a language detection 
toolkit ( |Shuyo, 2010 1, unless the sentence had 5 
words or less. Eor De-En, we also split the com¬ 
pounds in the German side by using Moses. Ei- 
nally we shuffled the training corpora seven times 
and concatenated its outputs. 


5.3 Monolingual Corpora 

The English Gigaword corpus by the Einguis- 
tic Data Consortium (EDC), which mainly con¬ 
sists of newswire documents, was allowed in both 
OpenMT’15 and IWSET-15 challenges for lan¬ 
guage modelling. We used the tokenized Gi¬ 
gaword corpus without any other preprocessing 
step to train three different RNNEM’s to fuse into 
NMT for Zh-En, Tr-En and the WMT’15 transla¬ 
tion tasks (De-En and Cs-En.) 


6 Settings 

6.1 Training Procedure 

6.1.1 Neural Machine Translation 

The input and output of the network were se¬ 
quences of one-hot vectors whose dimensionality 
correspond to the sizes of the source and target vo¬ 
cabularies, respectively. We constructed the vo¬ 
cabularies with the most common words in the 
parallel corpora. The sizes of the vocabularies 
for Chinese, Turkish and English were lOK, 30K 
and 40K, respectively, for the Tr-En and Zh-En 
tasks. Each word was projected into the contin¬ 
uous space of 620-dimensional Euclidean space 
first to reduce the dimensionality, on both the en¬ 
coder and the decoder. We chose the size of the 
recurrent units for Zh-En and Tr-En to be 1, 200 
and 1, 000 respectively. 

In Cs-En and De-En experiments, we were able 
to use larger vocabularies. We trained our NMT 
model for Cs-En and De-En with large vocabular¬ 
ies using the importance sampling based technique 
introduced in ( |Jean et ah, 2014| ) and with this tech- 
















Chinese English 

# of Sentences 

436K 

# of Unique Words 

21K 150K 

# of Total Words 

8.4M 5.9M 

Avg. Eength 

(a)Zf 

19.3 13.5 

i-En 

Czech English 

# of Sentences 

12.1M 

# of Unique Words 

1.5M 911K 

# of Total Words 

15IM 172M 

Avg. Eength 

12.5 14.2 


(c) Cs-En 



Turkish 

English 

# of Sentences 

160K 

# of Unique Words 

96K* 

95K 

# of Total Words 

11.4M* 

8.1M 

Avg. Eength 

31.6 

22.6 


(b) Tr-En 



German 

English 

# of Sentences 

4.1M 

# of Unique Words 

1.16M1 

742K 

# of Total Words 

ll.dMl' 

8.1M 

Avg. Eength 

24.2 

25.1 


De-En 


Table 1: Statistics of the Parallel Corpora. After segmentation, f: After compound splitting. 


nique we were able to use large vocabulary of size 
200A:. 

Each model was optimized using 
Adadelta ( |Zeiler, 2012| ) with minibatches of 
80 sentence pairs. At each update, we normalized 
the gradient such that if the L 2 norm of the 
gradient exceeds 5, gradient is renormalized back 
to 5 ( [Pascanu et al., 20T^ . Eor the non-recurrent 
layers (see Eq. Q), we used dropout (Hinton 
|et al., 20\2 ) and additive Gaussian noise (mean 
0 and std. dev. 0.001) on each parameter to 


prevent overfitting (Graves, 20111. Training was 
early-stopped to maximize the performance on 
the development set measured by BEEU. [] We 
initialized all recurrent weight matrices as random 
orthonormal matrices. 

6.1.2 Language Model 

We trained three RNNEM’s with 2,400 long 


short-term memory (ESTM) (Hochreiter and 


Schmidhuber, 19971 units on English Gigaword 
Corpus using respectively the vocabularies con¬ 
structed separately from the English sides of Zh- 
En and Tr-En corpora. The third language model 
was trained using 2,000 ESTM units on the En¬ 
glish Gigaword Corpus again but with a vocabu¬ 
lary constructed from the intersection the English 
sides of Cs-En and De-En. The parameters of the 
former two language models were optimized us¬ 


ing RMSProp (Tieleman and Hinton, 20121, and 


Adam optimizer (Kingma and Ba, 20141 was used 
for the latter one. Any sentence with more than 
ten percent of its words out of vocabulary was 
discarded from the training set. We did early- 


We compute the BLEU score using the multi-blue.perl 
script from Moses on tokenized sentence pairs. 


Stopping using the perplexity of development set. 

6.2 Shallow and Deep Fusion 
6.2.1 Shallow Fusion 

The hyperparameter /3 in Eq. © was selected to 
maximize the translation performance on the de¬ 
velopment set, from the range 0.001 and 0.1. In 
preliminary experiments, we found it important to 
renormalize the softmax of the EM without the 
end-of-sequence and out-of-vocabulary symbols 
(/3 = 0 in Eq. (|^). This may be due to the differ¬ 
ence in the domains of TM and EM. 


6.2.2 Deep Fusion 

We hnetuned the parameters of the deep output 
layer (Eq. (|^) as well as the controller (see Eq. (|7]l 
using the Adam optimizer for Zh-En, and RM¬ 
SProp with momentum for Tr-En. During the fine- 
tuning, the dropout probability and the standard 
deviation of the weight noise were set to 0.56 and 
0.005, respectively. Based on our preliminary ex¬ 
periments, we reduced the level of regularization 
after the first lOiC updates. In Cs-En and De-En 
tasks with large vocabularies, the model parame¬ 
ters were hnetuned using Adadelta while scaling 
down the magnitude of the update steps by 0.01. 


6.2.3 Handling Rare Words 

On the De-En and Cs-En translation tasks, we re¬ 
placed the unknown words generated by the NMT 
with the words the NMT assigned to which the 
highest score in the source sentence (Eq. Q)- 
We copied the selected source word in the place 
of the corresponding unknown token in the target 
sentence. This method is similar to the technique 
proposed by (Euong et al., 20141 for addressing 










































rare words. But instead of relying on an exter¬ 
nal alignment tool, we used the attention mech¬ 
anism of the NMT model to extract alignments. 
This method consistently improved the results by 
approximately 1.0 BLEU score. 



De-En 

Dev Test 

Cs-En 

Dev Test 

NMT Baseline 

25.51 

23.61 

21.47 

21.89 

Shallow Eusion 
Deep Eusion 

25.53 

25.88 

23.69 

24.00 

21.95 

22.49 

22.18 

22.36 


7 Results and Analysis 

7.1 Zh-En: OpenMT’15 


In addition to NMT-based systems, we also trained 
a phrase-based as well as hierarchical phrase- 
based SMT systems ( Koehn et ah, 2003} Chiang, 


20051 with/without re-scoring by an external neu¬ 


ral language model (CSLM) ( [Schwenk, 2007b I. 
We present the results in Table 

We observed that integrating an additional LM 
by deep fusion (see Sec. 4.2 1 helped the models 
achieving better performance in general, except 
in the case of the CTS task. We noticed that the 
NMT-based models, regardless of whether the LM 
was integrated or not, outperformed the more tra¬ 
ditional phrase-based SMT systems. 



SMS/CHAT 

CTS 


Dev 

Test 

Dev 

Test 

PB 

15.5 

14.73 

21.94 

21.68 

-tCSLM 

16.02 

15.25 

23.05 

22.79 

HPB 

15.33 

14.71 

21.45 

21.43 

-tCSLM 

15.93 

15.8 

22.61 

22.17 

NMT 

17.32 

17.36 

23.4 

23.59 

Shallow 

16.59 

16.42 

22.7 

22.83 

Deep 

17.58 

17.64 

23.78 

23.5 


Table 2: Results on the task of Zh-En. PB and 
HPB stand for the phrase-based and hierarchical 
phrase-based SMT systems, respectively. 


7.2 Tr-En: IWSLT’14 


In Table 1^ we present our results on Tr-En. Com¬ 
pared to Zh-En, we saw a greater performance im¬ 
provement up to -1-1.19 BLEU points from the ba¬ 
sic NMT to the NMT integrated with the LM un¬ 
der the proposed method of deep fusion. Eurther- 
more, by incorporating the LM using deep fusion, 
the NMT systems were able to outperform the best 
previously reported result ( Ydmaz et ah, 201^ by 
up to 4-1.96 BLEU points on all of the separate 
test sets. 


7.3 Cs-En and De-En: WMT-15 

We provide the results of Cs-En and De-En on 
Table H] Shallow fusion achieved 0.09 and 0.29 


Table 4: Results for De-En and Cs-En translation 
tasks on WMT’15 dataset. 


BLEU score improvements respectively on De-En 
and Cs-En over the baseline NMT model. With 
deep fusion the improvements of 0.39 and 0.47 
BLEU score were observed again over the NMT 
baseline. 


7.4 Analysis: Effect of Language Model 


The performance improvements we report in this 
paper reflect a heavy dependency on the degree of 
similarity between the domain of monolingual cor¬ 
pora and the target domain of translation. 

In the case of Zh-En, intuitively, we can tell that 
the style of writing in both SMS/CHAT as well 
as the conversational speech will be different from 
that of news articles (which constitutes the major¬ 
ity of the English Gigaword corpus). Empirically, 
this is supported by the high perplexity on the de¬ 
velopment set with our LM (see the column Zh-En 
of Table |^. This explains the marginal improve¬ 
ment we observed in Sec. 17. II 

On the other hand, in the case of Tr-En, the sim¬ 
ilarity between the domains of the monolingual 
corpus and parallel corpora is higher (see the col¬ 
umn Tr-En of Table [^. This led to a significantly 
larger improvement in translation performance by 
integrating the external language model than the 
case of Zh-En. Similarly, we observed the im¬ 
provement by both shallow and deep fusion in the 
case of De-En and Cs-En, where the perplexity on 
the development set was much lower. 

Unlike shallow fusion, deep fusion allows a 
model to selectively incorporate the information 
from the additional LM by the controller mech¬ 
anism from Sec. 4.2. 1| Although this controller 
mechanism works on per-word basis, it can be ex¬ 
pected that if the additional LM models the target 
domain better, the controller mechanism will be 
more frequently active on average, i.e., S> 0. 
Erom Tablej^ we can see that, on average, the con¬ 
troller mechanism is most active with De-En and 
Cs-En, where the additional LM was able to model 
the target sentences best. This effectively means 




























Development Set 

dev2010 tstlOlO 

tst2011 

Test Set 

tst2012 tst2013 

Test 2014 

Previous Best (singu) 

15.33 

17.14 

18.77 

18.62 

18.88 

- 

Previous Best (Combination) 

- 

17.34 

18.83 

18.93 

18.70 

- 

NMT 

14.50 

18.01 

18.40 

18.77 

19.86 

18.64 

NMT-i-LM (Shallow) 

14.44 

17.99 

18.48 

18.80 

19.87 

18.66 

NMT-i-LM (Deep) 

15.69 

19.34 

20.17 

20.23 

21.34 

20.56 


Table 3: Results on Tr-En. We show for eaeh set separately to make it easier to eompare to previously 
reported seores. 


that deep fusion allows the model to be more ro¬ 
bust to the domain mismateh between the TM and 
LM, thus suggests why deep fusion was more sue- 
eessful than shallow fusion in the experiments. 



Zh-En 

Tr-En 

De-En 

Cs-En 

Perplexity 

223.68 

163.73 

78.20 

78.20 

Average g 

0.23 

0.12 

0.28 

0.31 

Std Dev g 

0.0009 

0.02 

0.003 

0.008 


Table 5: Perplexity of RNNLM’s on develop¬ 
ment sets and the statisties of the eontroller gating 
meehanism g. 

8 Conclusion and Future Work 

In this paper, we propose and eompare two meth¬ 
ods for ineorporating monolingual eorpora into 
an existing NMT system. We empirieally eval¬ 
uate these approaehes (shallow fusion and deep 
fusion) on low-resouree En-Tr (TED/TEDx Sub¬ 
titles), foeused domain for En-Zh (SMS/Chat and 
eonversational speeeh) and two high-resouree lan¬ 
guage pairs: Cs-En and De-En. We show that with 
our approaeh on the Tr-En and Zh-En language 
pairs, the NMT models trained with deep fusion 
were able to aehieve better results than the ex¬ 
isting phrase-based statistieal maehine translation 
systems (up to a -1-1.96 BEEU points on En-Tr). 
We also observed up to a 0.47 BEEU seore im¬ 
provement for high resouree language pairs sueh 
as De-En and Cs-En on the datasets provided in 
WMT’15 eompetition over our NMT baseline. 
This provides an evidenee that our method ean 
also improve the translation performanee regard¬ 
less of the amount of available parallel eorpora. 

Our analysis also revealed that the performanee 
improvement from ineorporating an external EM 
was highly dependent on the domain similarity be¬ 
tween the monolingual eorpus and the target task. 
In the ease where the domain of the bilingual and 


monolingual eorpora were similar (De-En, Cs- 
En), we observed improvement with both deep 
and shallow fusion methods. In the ease where 
they were dissimilar (Zh-En), the improvement us¬ 
ing shallow fusion were mueh smaller. This trend 
might also explain why deep fusion, whieh im¬ 
plements an adaptive meehanism for modulating 
information from the integrated EM, works better 
than shallow fusion. This analysis also suggests 
that future work on domain adaption of the lan¬ 
guage model may further improve translations. 
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